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Abstract 

We provide a variable metric stochastic ap- 
proximation theory. In doing so, we pro- 
vide a convergence theory for a large class 
of online variable metric methods including 
the recently introduced online versions of 
the BFGS algorithm and its limited-memory 
LBFGS variant. We also discuss the implica- 
tions of our results for learning from expert 
advice. 



1 Introduction 

We begin by introducing online optimization and in 
particular stochastic gradient descent methods. 

1.1 Online Optimization 

There exists an abundance of applications that can 
lead us to online optimization problems where we are 
trying to find the minimum of a data-dependent func- 
tion C : K" ^ M,w ^ C{w) = C{w,Z), where 
Z represents data or a probability distribution that 
generates data. During an online optimization pro- 
cedure, a sequence of parameters wt, t — 1,2,... is 
created by using an update rule for how to define 
Wt+i given the earlier parameters and the new data, 
Zt, that has arrived at time t. Given a sequence of 
data samples zi, Zt drawn from a fixed distribution, 
we will use the notation Ct(-) = 7 X]i=i ^i) for 
the empirical objective which is the average of on- 
line/stochastic/instantaneous objectives C{-,z). We 
will refer to C(-) = C(-, z) as the true objective. 



*Also an adjunct at the second author's affiliation. 



Corrects Theorem 3.4 in Article Appearing in Proceedings 
of the 12*'' International Confe-rence on Artificial Intelli- 
gence and Statistics (AISTATS) 2009, Clearwater Beach, 
Florida, USA. Volume 5 of JMLR: W&CP 5. 2009. 



1.2 Stochastic Gradient Descent 

If one has defined a metric k on the parameter space 
by supplying a dot product, and if the online objec- 
tives are differentiable, then we can use the gradients 
with respect to that metric to define the update 
equation 



wt+i ^wt- atV'^Ciwt, zt),at > 0. 



(1) 



If one uses metrics defined by different dot products 
for different t, then one can let V denote the gradient 
with respect to the standard Euclidean dot product 
and instead let the updates take the form 



wt+i =wt - atBtVniC{wt,zt), 



(2) 



where the Bt are positive definite and symmet- 
ric matrices. One example of considering vari- 
able metrics is the s tudy of information geometry 
( Amari and Nagaoka . 19931 ). where the Fisher infor- 
mation matrix is used to define a metric tensor on a 
family of probability distributions. More specific ex- 
amples will be provided in Section [5] below. 

1.3 Outline and Summary 

We investigate the theoretical foundations for using 
online updates that include scaling matrices, that is 
stochastic gradient descent where the gradients are 
taken with respect to time-varying metrics. Among 
other results, this provides a convergence proof for a 
large class of variable metric met hods including the 



recent online (L)BFGS algorithm (iSchraudolph et al 



20071 ) . In Section [3] we employ the Robbins-Siegmund 
theorem to prove 0{l/t) convergence in function 
values C{wt) for a class of functions that is use- 
ful i n machine learning . This is the best possible 
rate (|Bottou and LeCuiJ . l2005t lAmaril Il998l; iMurata . 
1998I ). limited only by the rate at which information 
arrives. Under weaker assumptions we show almost 
sure convergence witho ut rate for a larger c lass. Our 
results extend those in Bottou and LeCun ( 20051 ) by 
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not demanding that the metrics converge to an asymp- 
totic metric. 

We first introduce motivating apphcation areas in Sec- 
tion [21 Then in Section [3] we both provide a back- 
ground in stochastic approximation theory and our 
new theorems. We apply our theorems to prove con- 
vergence for online BFGS in Section [4l Furthermore, 
we consider implications in the area of learning from 
expert advice in Section [51 The paper concludes with 
a discussion in Section [SI 

2 Examples 

In this section we provide different objective functions 
and describe the problem settings they appear in. 

2.1 Online Risk Minimization 

The goal of many machine learning algorithms is to 
minimize the risk ,y), where the expecta- 

tion is taken with respect to a fixed but unknown prob- 
ability distribution Z that generates instance-label 
pairs z := {x,y). The loss I is a non-negative con- 
vex function of the parameters w and measures the 
discrepancy between the labels y and the predictions 
arising from x and w via their inner product (w, x) 
(often the Euclidean dot product if a;,w G M''). 

In the absence of complete knowledge about the 
underlying distribution, an empirical sample Z = 
{(xi, yi),i = 1, • . . , is often used to minimize the 
regularized empirical risk 



the samples: 



1 " 



(3) 



i=l 



where the L2-regularization term 



P is introduced 



for well-posedness. With larger n one can use smaller 
c > 0. The true regularized risk that is estimated by 
© is f||w||2+E,;((w,x),y). 

Batch optimization algorithms, including quasi- 
Newton and bundle-based methods, are available 
and widely used to minimize ([3]), but they are 
computationally expensive. Gradient-based batch 
methods may also fail to converge if the loss is 
non-smooth. Therefore, online optimization meth- 
ods that work with small subsamples of training 
data have received co ns idera, bl e attention r e cently 
( Kivinen and Warmuth l. Il997l : [Schraudolphl. 12002 : 



Azourv an d War muthl. 200 It 
20071: ISchraudolp h et all . l2007t ) 



Shalev-Shwartz et al 



In the online setting, we replace the objective ([3]) with 
approximations based on subsets ("mini-batches") of 

^With some abuse of notation we use Z to represent 
either a data set or a distribution that generates data. 



X! K{w,x),y), 



(4) 



where Zt C Z with \Zt\ — b <C n. Furthermore, 
in online learning we can consider using c — and 
thereby aiming directly at minimizing the true ob- 
jective ¥,zl{{w,x) ,y). During online optimization, a 
sequence of parameters Wt, t — 1,2,... arises from 
an update rule that computes Wt+i from the previous 
state and the new information at time t. In addition to 
alleviating the high computational cost of batch meth- 
ods, the online setting also arises when the data itself 
is streaming, that is, we are receiving partial informa- 
tion about C(-) in a sequence of small packages. 

Including second-order information in the online op- 
timization procedu r e can accelerate the convergence 



(jSchraudolph et al.l . 120071 ). This is particularly true in 



a settin g where only one pass thro ugh a dataset is per- 
formed. lBottou and LeCtml (|2005l ) and lMuratal (|l998( ) 



point out that minimizing the empirical objective is 
different from minimizing the true objective, and show 
that the result of an online second-order gradient de- 
scent procedure can be as close to the true optimum 
as the minimum of the empirical objective. 

2.2 Filtering 

The goal in filtering is to separate the signal from the 
noise in a stream of data. Kalman algorithms in par- 
ticular use the Euclidean distance (sum-squared loss) 
and track the minimizer of the empirical objectives 
Ct{w) — J2]=iiyt — 'w-xt)'^. The inverse of the Hessian 
of Ct is used a.s Bt , at — 1 and wo — for the update in 
(0). Bt+i is found from Bt with an update whose cost 
is order . The result is that Wt = argmin^ Ct (w) . 
Therefore, if we have a fixed distribution the sequence 
will converge to the optimal parameters. 

The same algorithm can be extended to a more gen- 
eral setting where the sum-squared loss is replaced 
by arbitrary convex functions. The resulting algo- 
rithm was ca l led th e online Newton-step algorithm by 
Hazan et al.l (j2007l ) and described as an approximate 
"follow the leader" algorithm, i.e., an algorithm that 
approximately follows the optimum that the Kalman 
filter tracks exactly for the sum-squared loss. 

2.3 Learning from Expert Advice with 
Bregman Divergences 

The general c onvex optimizatio n fram ework by 
ZinkevichI (|2003l ) that iHazan et af] (|2007l) worked in 
is also related to the expe rt advice framework by 
Azoury and Warmuthl (|200ll ). In the expert advice 
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framework one encounters a sequence of loss functions 
Lt {Lt{w) — C{'w, Zt) here) and one wants to perform 
the implicitly defined update 

wt+i = argmin Ai/(ti;, wt) + atLt{w) (5) 

w 

where Ah is the Bregman divergence 

Anip, q) = H{p) - H{q) + VHiq) -ip-q) (6) 

defined by the differentiable convex function H. The 
squared Euclidean distance is an example of a Breg- 
man divergen£e;_£orres2ondinffto^ = ^Iblli- The 
goal for Azourv and Warmuth ( 200ll ). as it was for 
ZinkevichT i 2003 ) . was to have a small total accumu- 
lated loss X]*=i ^i(^j) ^"^d in particular small regret, 
J2]=i -^ji'^j) — min,„ y^*_i Li(w). To de rive an ex 



J ^ - - I J 

plicit update, lAzourv and Warmuthl ([20011) differenti- 
ate the expression to be minimized 

V{AHiw,Wt)+atLt{w)) (7) 
h{w) - h{wt) + atVLt{w) 

and by using the approximation \/Lt{w) w VLt{wt), 
they arrive at the updates 



wt+i = h ^{h{wt) - atVLt{wt)). 



(8) 



This reduces to online gradient descent, Wt+i = wt — 
VLt(wt), for the case h{w) = w, i.e., when H(w) — 

Azourv and Warmuth ( 200ll) in particular consider 
loss functions of a special form, namely the case where 
Lt{w) = Aa{xt ■ w,g~^{yt)) where g = VG is an in- 
creasing continuous function from R to itself and where 
(xt, Ut) = Zt is an example. This results in simpler up- 
dates since it implies that VLt(w) = {g{w ■ Xt) — yt)xt. 
Note that, given a parametrization of M'', a one- 
dimensional transfer function g can be used to de- 
fine a d-dimensional continuous bijection by applying 
it coordinate-wise. 

It is interesting to compare ([5]) to a reparametrization 
where one makes a coordinate change 7 = h{w). Then 
dH]) can be written as 

h{wt+i) = h{wt) - atVniLt{wt). (9) 

The one obstacle to identifying ([9]) with 

7t+i = It - atVjLt{-^t) (10) 

where ^4(7) = L{h~^{j)), is that the gradient of Lf is 
taken with respect to w. We know that V^Lt{jt) = 
BtVwLt{wt), where Bt is the inverse Hessian of H at 
Wt, in other words the inverse Jacobian of h. 

A reason for doing a reparametrization is that the loss 
functions Lt might not satisfy the conditions that are 



needed for convergence of stochastic gradient descent, 
but it can be possible to find a transfer function h 
such that Lt does. In that sense, the update ^ can 
be a way of trying to do this even if we only have 
the gradients V^^Lt and do not want to calculate the 
matrices Bt . Our main theorems will tell us when it is 
acceptable to omit Bt, and thereby use an SGD update 
with respect to varying metrics. 

3 Stochastic Approximation Theory 



Robbins and Monrol (|l95ll ) proved a theorem that im- 
plies converg ence for one- dimensional stochastic gra- 
dient descent: Blum (Il954j) generaliz e d it to the multi- 
variate case. Robbins and Siegmundl ( 1971 ) achieved a 
stronger result of wider applicability in supermartin- 
gale the ory. Here we extend the known convergence 
results ( Bottou and LeCunl . l2005h in two ways: a) 
We prove that updates that include scaling matrices 
with eigenvalues bounded by positive constants from 
above and below will converge almost surely; b) under 
slightly stronger assumptions we obtain a 0{l/t) rate 
of convergence in the function values. 

3.1 The Multivariate Robbins-Monro 
Procedure 

Suppose that VC — f : M.'' ^ R''" is an unknown 
continuous function that we want to find a root of. 
Furthermore, suppose that there is a unique root and 
denote it by w* . Given an initial estimate wi, the 
procedure constructs a sequence of estimates wt such 
that Wt — > If* as i — > 00. For any random vec- 
tor X let Kt{X) be the conditional expectation given 
wi, . . . ,wt. Given wi, . . . ,Wt, we assume that we ob- 
serve an unbiased estimate Yt of f{wt) — VC(wt), i.e., 
EYt — f{wt)- Given Yt and Wt we define Wt+i by 



wt+i ^wt- atYt, 



(11) 



where aj > for all t, ^at — 00 and "^a^ < 00. 
The Yt are assumed to be drawn from a family Y(x) 
of random vectors defined for all a; € R*^, and Yt is 
distributed as Y{wt). To ensure that wt converges to 
w* almost surely it is sufficient to assume that there 
are finite constants A and B such that E ||y(a;)|| 



< 



A + B\\x- 
inf{(a; - 



for all X, and that for all e > 



w 



■ff{x):e<\\x 



w 



<s-^}>0. (12) 



For instance, strictly convex functions satisfy p2|l . 
This classical convergence result is implied by the fol- 
lowing theorem on almost positive supermartingales, 
which we will also use to prove our results: 



Theorem 3.1 ( Robbins and Siegm undl 1971 ^ 

Let {fl, T , P) be a probability space and J^i Q J^2 Q ■ ■ ■ 
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a sequence of sub-a-fields of T . Let Ut,Pt,^t and C.t, 
t = 1,2,... be non-negative J-'t -measurable random 
variables such that 

E([/t+i I Tt) < (1 + l3t)Ut + - Ct, i = 1, 2, . . . 

(13) 

Then on the set {J2t i^t < oo, J2t < oo}, Ut con- 
verges almost surely to a random variable, and J2t Ct < 
oo almost surely. 

3.2 Convergence for Updates with Scaling 
Matrices 



Bottou and LeCunl (|2005[ ) previously presented results 
on the rate of convergence in parameter space (us- 
ing the Euclidean norm) of online updates with con- 
vergent scaling matrices, and remark that bounds 
on the eigenvalues of the scaling matrix will be es- 
sential to extending convergence guarantees beyond 
that. Since recent onl ine quasi-Newton methods 
(jSchraudolph et al.l 120071 ) do not provide convergence 
of their scaling matrices, we now employ the Robbins- 
Siegmund theorem to prove that such updates are 
still guaranteed almost sure convergence, provided the 
spectrum of their (possibly non-convergent) scaling 
matrices is uniformly bounded from above by a finite 
constant and from below by a strictly positive con- 
stant. 

Theorem 3.2 Let C : W ^ R be a twice differen- 
tiable cost function with unique minimum w* , and let 



wt+i = wt~ otBtYt, 



(14) 



where Bt is symmetric and only depends on informa- 
tion available at time t. 

Then Wt converges to w* almost surely if the following 
conditions hold: 

C.l (yt) EtYt^V^Ciwt); 

C.2 {3K){Vw) \\\/lC{w)\\<2K; 

C.3 (V(5 > 0) infc(^)_c(.«*)>5 l|V^C(u;)|| > 0; 

C.4 {3A,B)(\/t) E\\Yt\\^ <A-hBC{wt); 

C.5 {3 m,M : < m < M < oo) (Vi) ml ^ Bt -i. 
MI , where I is the identity matrix; 

C.6 flj < oo and at — oo. 

Proof Since C is twice differentiable and has bounded 
Hessian (C.2) we can use Taylor expansion and the 
upper eigenvalue bound (C.5) to prove that 



which implies, using (C.l) and (C.4), that 
Et Ciwt+i) < Ciwt) + KM^a^tlA + BCiwt)]- (16) 

at[V^Ciwt)fBtV,,Ciwt). 

If we let Ut = C{wt) and merge the terms containing 
Ut it follows that Et Ut+i < 

UtilWtBKM^)+AKM^a^~mat\\Vn,C{wt)\\'^. (17) 

Since J^t^t < °° (C.6), the Robbins-Siegmund theo- 
rem can now be applied. We find that 

^at||V^C'(u;t)f <oo. (18) 
t 

Since ^ at = oo (C.6) it follows from (C.3) that 

\\{S/^C{wt)W^O (19) 
and that C{wt) C{w*) as i oo. ■ 



Remark 3.3 The assumption that C is twice differ- 
entiable is only needed for the Taylor expansion we use 
to obtain (|15p . // we have such a property from else- 
where we do not need twice- differentiability. 

3.3 Asymptotic Rates 

Consider the situation described in Theorem 13.21 We 
now strengthen assumption (C.3), which demands that 
the function C is not so flat around the minimum that 
we may never approach it, to instead assuming that 



C{wt)-C{w*) 
\\VC{wtW 



< D <oo Vt. 



(20) 



Ciwt+i) ^ C{wt - atBtYt) < 



(15) 



Condition ()20|1 is implied by strong convexity if we 
know that the wt tend to the optimum w* . Since 
VC{w*) = we can use first-order Taylor expan- 
sion of VC around w* to approximate l|VC(ii;)|p by 
[w - w*)'^V^C{w*){w - w*). 

We will also modify assumption (C.4) by setting B = 
0. Theorem 13.21 guarantees (under the weaker condi- 
tions) that the procedure will almost surely generate 
a converging sequence which is therefore contained in 
some ball around w*. This makes the new condition 
reasonable. Bottou (1998) contains a more elaborate 
discussion on what is there called "Global Confine- 
ment" . 

We need a result on what the expected improvement 
is, given the step size at, the uniform bound on the 
Hessian, and the uniform eigenvalue bounds. The key 
to achieving this is pS]) . This section's counterpart to 
(1161) under the new conditions is 



Ciwt) - at[V^nCiwt)V BtYt + KM'aiWYt 



EtC{wt+i)-C{w*) < 



(21) 
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[C{wt) - C(w*)](l - atm/D) + AKM^al 

We want to know the rate of the sequence EC(wt) — 
inf^C(?Z'), and will use the fact that taking the un- 
conditional expectation of (|2ip yields the result 



< 



ECK+i) - C{w 

[EC{wt) - C(w*)](l - atm/D) -r ^.v... 
We are now in a position to state our result: 



(22) 



AKM^a^. 



Theorem 3.4 Let C : R" — > R 6e a twice differen- 
tiable cost function with unique minimum w* . Assume 
that 



1. Conditions C.1-C.6 from Theorem \ 3. 2\ hold with 
B ^0 in CI 

2. Equation (gOl) holds. 

3. at = J with T > D/m. 

ThenEC{wt) — vai^ C{'w) is equivalent to ^ as t oo. 



Proof iBottou and LeCunI (|2005l A.4) state: if 
ut ^ [1- f + oil/t)]ut-i + # + 0(1/^2) with a > 1 
and /3 > 0, then tut s~T- Theorem 13.41 follows 
from dSll) by setting ut = ]EC(wt) - ^(u;*). ■ 



4 Quasi-Newton Methods 

Quasi-Newton methods are optimization methods 
with updates of the form Wt+i ^ wt — at BtV wCiwt), 
where a* > is a scalar gain (typically set by a line 
search) and Bt a positive-semidefinite scaling matrix. 
If Bt — /, we have simple gradient descent; setting Bt 
to the inverse Hessian of C{w) we recover Newton's 
method. Inverting the kxk Hessian is a computation- 
ally challenging task if k is large; quasi-Newton meth- 
ods reduce the computational cost by incrementally 
maintaining a symmetric positive-definite estimate Bt 
of the inverse Hessian of the objective function. 



4.1 (L)BFGS 



The BFGS algorithm (jNocedal and Wrightl . [T999[ ) was 
developed independently by Broyden, Fletcher, Gold- 
farb, and Shanno in 1970. It incrementally main- 
tains its estimate Bt of the inverse Hessian via a 
rank-two update that minimizes a weighted Frobe- 
nius norm — subject to the secant equa- 

tion St = Bt+iyt, where St := Wt+i - Wt and yt := 
V wC{wt+i) — 'S/wC(wt) denote the most recent step 
along the optimization trajectory in parameter and 
gradient space, respectively. LBFGS is a limited- 
memory (matrix-free) version of BFGS. 



4.2 Online (L)BFGS 

Recently developed online variants of BFGS 
and LBFGS, called oBFGS resp. oLBFGS, are 
amenable to stocha st ic app roximation of the gradient 
2Q03). The key differences 
and batch algorithms can be 



(jSchraudolph et al 



between the online 
summarized as follows: 

The gradient of the objective is estimated from small 
samples (mini-batches) of data. The difference yt of 
gradients is computed with gradients for the same data 
sample, i.e., for the same function C(-, Zt). Line search 
is replaced by a gain sequence at- A trust region pa- 
rameter A is introduced modifying the algorithm to 
estimate {Ht -{- XI)~^, where Ht is the Hessian at iter- 
ation t; this prevents the largest eigenvalue of Bt from 
exceeding A~^. 



See ISchraudolph et al.l (|2007f ) for a more detailed de- 
scription of the oLBFGS and oBFGS algorithms, and 
for experimental results on quadratic bowl objectives 
and conditional random fields (CRFs), which are in- 
stances of risk minimization. 



Remark 4.1 

value bound 



To guarantee a uni form lower 
for the updates of ISchraudolph et al 



eiqen- 



12002) we would have to use Bt + 7/ for some 7 > 0, 
ef fectively interpolating betw een o(L)BFGS as defined 
bv lSchraudolvh et al. {200'\ ) and simple online gradi- 
ent descent. This lower bound is not needed for con- 
vergence per se but to prove that the convergence is to 
the minimum. 

4.3 Filtering 



Kailath et al.l (|200 0'. Chapter 14) present assumptions 
in control theory that imply either convergence of the 
matrices Bt or upper and lower eigenvalue bounds. 
The relevant control theory is too extensive and com- 
plicated to be reviewed here and we therefore only 
point out the connection. 

5 Expert Advice with Bregman 
Divergences 

We now comp are the updates in the exper t advice 
framework by lAzoury and Warmuth ( 2001 ) to the 
SGD updates that would result from a non-linear 
reparametrization. 



As outlined in section 
update 



the difference between the 



Wt+l 



= h-\h{wt) -atVLtiwt)). 



(23) 



and performing stochastic gradient descent with re- 
spect to a new variable 7 = h,{w) is that the latter 
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would include matrices Bt which are the inverse Jaco- 
bians of h at Wt, i.e. the inverse Hessian of H where 
VH = h. The update 74+1 — — o,t^ iLt{'^t) where 
Lt(7) — Lt(/i~^(7)), expressed with respect to w looks 
as follows 

wt+i = h-\h{wt) - atBtVLtiwt)). (24) 
Therefore can be written as 



7t+i = 7t - at-Bt V^Zt(7t 



(25) 



The main point of the reparametrization would be to 
change variables so that SGD converges with an op- 
timal rate for the new objectives. We would like the 
new objective to be approximately quadratic. Assum- 
ing that we have chosen the transfer function in such a 
manner that our main theorems apply to the new ob- 
jective function, it remains to check the scaling matrix 
condition. 



To satisfy the conditions for the scaling matrices 
we need the Jacobian of the transfer function to 
have upper and lower eigenvalue bounds. In the lit- 
erature that these methods are studied in, an as- 
sumption of uniformly boun ded gradients VLt(wt) 
(jAzourv and Warmuthl . l200ll ) is often used, or the pa- 
rame ters are restricted to a compact set ( Zinkevichl . 
20031 ) . In that case the conditions on scaling matrices 
become easier to satisfy: in any compact set, popular 
functions like e^, (l-|-e^)~^, or other sigmoid functions 
(though not, 6.17., 9^ which is flat at the origin) have 
derivatives that are bounded from above and below by 
a strictly positive constant. 



6 Conclusion 

We provide a variable metric stochastic approximation 
theory which implies convergence for stochastic gra- 
dient descent even when the gradients are calculated 
with respect to variable metrics. Metrics are some- 
times changed for optimization purposes, as in the 
case of using quasi-Newton methods. Our main the- 
orems imply convergence results for online versions of 
the BFGS and LBFGS optimization methods. Kalman 
filters are a class of well-known algorithms that can be 
viewed as an online Newton method for the special 
case of square losses, since the procedure at every step 
performs a gradient step where the gradient is defined 
using the Hessian of the loss for the examples seen so 
far. Finally we investigate the task of learning from ex- 
pert advice where Bregman divergences are frequently 
used to achieve updates that are suitable for the task at 
hand. We interpret the resulting updates as stochastic 
gradient descent in a space that has undergone a non- 
linear reparametrization, and where we use different 
metrics depending on the point we are at. 
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